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Analogical reasoning depends fundamentally on the ability to 
learn and generalize about relations between objects. We develop an 
approach to relational learning which, given a set of pairs of objects 
S = {A w : B m , A {2) : B (2) , . . . , A {N) :B {N) }, measures how well other 
pairs A:B fit in with the set S. Our work addresses the following 
question: is the relation between objects A and B analogous to those 
relations found in S? Such questions are particularly relevant in in- 
formation retrieval, where an investigator might want to search for 
analogous pairs of objects that match the query set of interest. There 
are many ways in which objects can be related, making the task of 
measuring analogies very challenging. Our approach combines a sim- 
ilarity measure on function spaces with Bayesian analysis to produce 
a ranking. It requires data containing features of the objects of inter- 
est and a link matrix specifying which relationships exist; no further 
attributes of such relationships are necessary. We illustrate the po- 
tential of our method on text analysis and information networks. An 
application on discovering functional interactions between pairs of 
proteins is discussed in detail, where we show that our approach can 
work in practice even if a small set of protein pairs is provided. 

1. Contribution. Many university admission exams, such as the Ameri- 
can Scholastic Assessment Test (SAT) and Graduate Record Exam (GRE), 
have historically included a section on analogical reasoning. A prototypical 
analogical reasoning question is as follows: 

doctor : hospital: 
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(A) sports fan : stadium 

(B) cow :farm 

(C) prof essor : college 

(D) criminal : j ail 

(E) food : grocery store 

The examinee has to answer which of the five pairs best matches the 
relation implicit in doctor : hospital. Although all candidate pairs have 
some type of relation, pair professor : college seems to best fit the notion 
of (profession, place of work), or the "works in" relation implicit between 
doctor and hospital. 

This problem is nontrivial because measuring the similarity between ob- 
jects directly is not an appropriate way of discovering analogies, as exten- 
sively discussed in the cognitive science literature. For instance, the analogy 
between an electron spinning around the nucleus of an atom and a planet or- 
biting around the Sun is not justified by isolated, nonrelational, comparisons 
of an electron to a planet, and of an atomic nucleus to the Sun [Gentner 
(1983)]. Discovering the underlying relationship between the elements of 
each pair is key in determining analogies. 

1.1. Applications. This paper concerns practical problems of data anal- 
ysis where analogies, implicitly or not, play a role. One of our motivations 
comes from the bioPIXIE 2 project [Myers et al. (2005)]. bioPIXIE is a tool 
for exploratory analysis of protein-protein interactions. Proteins have multi- 
ple functional roles in the cell, for example, regulating metabolism and regu- 
lating cell cycle, among others. A protein often assumes different functional 
roles while interacting with different proteins. When a molecular biologist 
experimentally observes an interaction between two proteins, for example, a 
binding event of {Pi,Pj}, it might not be clear which function that particu- 
lar interaction is contributing to. The bioPIXIE system allows a molecular 
biologist to input a set S of proteins that are believed to have a particu- 
lar functional role in common, and generates a list of other proteins that 
are deduced to play the same role. Evidence for such predictions is pro- 
vided by a variety of sources, such as the expression levels for the genes 
that encode the proteins of interest and their cellular localization. Another 
important source of information bioPIXIE takes advantage of is a matrix 
of relationships, indicating which proteins interact according to some bio- 
logical criterion. However, we do not necessarily know which interactions 
correspond to which functional roles. 

The application to protein interaction networks that we develop in Section 
5 shares some of the features and motivations of bioPIXIE. However, we aim 
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at providing more detailed information. Our input set S is a small set of pairs 
of proteins that are postulated to all play a common role, and we want to 
rank other pairs Pi : Pj according to how similar they are with respect to 
S. The goal is to automatically return pairs that correspond to analogous 
interactions. 

To use an analogy itself to explain our procedure, recall the SAT example 
that opened this section. The pair of words doctor : hospital presented in 
the SAT question play the role of a protein-protein interaction and is the 
smallest possible case of S, that is, a single pair. The five choices A-E in 
the SAT question correspond to other observed protein-protein interactions 
we want to match with S, that is, other possible pairs. Since multiple valid 
answers are possible, we rank them according to a similarity metric. In the 
application to protein interactions, in Section 5, we perform thousands of 
queries and we evaluate the goodness of the resulting rankings according to 
multiple gold standards, widely accepted by molecular and cellular biologists 
[Ashburner et al. (2000); Kanehisa and Goto (2000); Mewes et al. (2004)]. 

The general problem of interest in this paper is a practical problem of in- 
formation retrieval [Manning, Raghavan and Schiitze (2008)] for exploratory 
data analysis: given a query set S of linked pairs, which other pairs of objects 
in my relational database are linked in a similar way? We apply this analysis 
to cases where it is not known how to explicitly describe the different classes 
of relations, but good models to predict the existence of relationships are 
available. In Section 4 we consider an application to information retrieval in 
text documents for illustrative purposes. Given a set of pairs of web pages 
which are related by some hyperlink, we would like to find other pairs of 
pages that are linked in a similar way. In information network settings, the 
proposed method could be useful, for instance, to answer queries for en- 
cyclopedia pages relating scientists and their major discoveries, to search 
for analogous concepts, or to identify the absence of analogous concepts, 
in Wikipedia. From an evaluation perspective, this application domain pro- 
vides an example where large scale evaluation is more straightforward than 
in the biological setting. 

In this paper we introduce a method for ranking relations based on the 
Bayesian similarity criterion underlying Bayesian sets, a method originally 
proposed by Ghahramani and Heller (2005) and reviewed in Section 2. In 
contrast to Bayesian sets, however, our method is tailored to drawing analo- 
gies between pairs of objects. We also provide supplementary material with 
a Java implementation of our method, and instructions on how to rebuild 
the experiments [Silva et al. (2010)]. 

1.2. Related work. To give an idea of the type of data which our method 
is useful for analyzing, consider the methods of Turney and Littman (2005) 
for automatically solving SAT problems. Their analysis is based on a large 
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corpus of documents extracted from the World Wide Web. Relations be- 
tween two words Wi and Wj are characterized by their joint co-ocurrence 
with other relevant words (such as particular prepositions) within a small 
window of text. This defines a set of features for each Wi : Wj relationship, 
which can then be compared to other pairs of words using some notion of 
similarity. Unlike in this application, however, there are often no (or very 
few) explicit features for the relationships of interest. Instead we need a 
method for defining similarities using features of the objects in each rela- 
tionship, while at the same time avoiding the mistake of directly comparing 
objects instead of relations. 

One of the earliest approaches for determining analogical similarity was 
introduced by Rumelhart and Abrahamson (1973). In their paper, one is 
initially given a set of pairwise distances between objects (say, by the sub- 
jective judgement of a group of people). Such distances are used to embed 
the given objects in a latent space via a multidimensional scaling approach. 
A related pair A : B is then represented as a vector connecting A and B in 
the latent space. Its similarity with respect to another pair C : D is defined 
by comparing the direction and magnitude of the corresponding vectors. 
Our approach is probabilistic instead of geometrical, and operates directly 
on the object features instead of pairwise distances. 

We will focus solely on ranking pairwise relations. The idea can be ex- 
tended to more complex relations, but we will not pursue this here. Our 
approach is described in detail in Section 3. 

Finally, the probabilistic, geometrical and logical approaches applied to 
analogical reasoning problems can be seen as a type of relational data anal- 
ysis [Dzeroski and Lavrac (2001); Getoor and Taskar (2007)]. In particular, 
analogical reasoning is a part of the more general problem of generating la- 
tent relationships from relational data. Several approaches for this problem 
are discussed in Section 6. To the best of our knowledge, however, most ana- 
logical reasoning applications are interesting proofs of concept that tackle 
ambitious problems such as planning [Veloso and Carbonell (1993)], or are 
motivated as models of cognition [Gentner (1983)]. Our goal is to create an 
off-the-shelf method for practical exploratory data analysis. 

2. A review of probabilistic information retrieval and the Bayesian sets 
method. The goal of information retrieval is to provide data points (e.g., 
text documents, images, medical records) that are judged to be relevant to a 
particular query. Queries can be defined in a variety of ways and, in general, 
they do not specify exactly which records should be presented. In practice, 
retrieval methods rank data points according to some measure of similarity 
with respect to the query [Manning, Raghavan and Schiitze (2008)]. Al- 
though queries can, in practice, consist of any piece of information, for the 



RANKING RELATIONS USING ANALOGIES 



5 



purposes of this paper we will assume that queries are sets of objects of the 
same type we want to retrieve. 

Probabilities can be exploited as a measure of similarity. We will briefly 
review one standard probabilistic framework for information retrieval [Man- 
ning, Raghavan and Schiitze (2008), Chapter 11]. Let R be a binary random 
variable representing whether an arbitrary data point X is "relevant" for a 
given query set S (R = 1) or not (R = 0). Let P(-\-) be a generic probability 
mass function or density function, with its meaning given by the context. 
Points are ranked in decreasing order by the following criterion: 

P(R = 1\X,S) _ P(R = 1\S) P(X\R = 1, S) 
P{R = 0\X, S) ~ PJR = 0|S) P(X\R = 0, S) ' 

which is equivalent to ranking points by the expression 

(2.1) logP(X\R = 1,8) - logP(X\R = 0,8). 

The challenge is to define what form P(X\R = r, S) should assume. It 
is not practical to collect labeled data in advance which, for every possible 
class of queries, will give an estimate for P(R = 1\X,8): in general, one 
cannot anticipate which classes of queries will exist. Instead, a variety of 
approaches have been developed in the literature in order to define a suitable 
instantiation of (2.1). These include a method that builds a classifier on-the- 
fly using S as elements of the positive class R = 1 , and a random subset of 
data points as the negative class R = [e.g., Turney (2008b)]. 

The Bayesian sets method of Ghahramani and Heller (2005) is a state- 
of-the-art probabilistic method for ranking objects, partially inspired by 
Bayesian psychological models of generalization in human cognition [Tenen- 
baum and Griffiths (2001)]. In this setup the event "R = 1" is equated with 
the event that X and the elements of S are i.i.d. points generated by the 
same model. The event "R = 0" is the event by which X and S are generated 
by two independent models: one for X and another for S. The parameters 
of all models are random variables that have been integrated out, with fixed 
(and common) hyperparameters. The result is the instantiation of (2.1) as 

(2.2) logP(X|S) - lo-gP(X) = log p^p^ , 

the Bayesian sets score function by which we rank points X given a query 
S. The right-hand side was rearranged to provide a more intuitive graphical 
model, shown in Figure 1. From this graphical model interpretation we can 
see that the score function is a Bayes factor comparing two models [Kass 
and Raftery (1995)]. 

In the next section we describe how the Bayesian sets method can be 
adapted to define analogical similarity in the biological and information 
networks settings we consider, and why such modifications are necessary. 
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3. A model of Bayesian analogical similarity for relations. To define an 
analogy is to define a measure of similarity between structures of related 
objects. In our setting, we need to measure the similarity between pairs of 
objects. The key aspect that distinguishes our approach from others is that 
we focus on the similarity between functions that map pairs to links, rather 
than focusing on the similarity between the features of objects in a candidate 
pair and the features of objects in the query pairs. 

As an illustration, consider an analogical reasoning question from a SAT- 
like exam where for a given pair (say, water : river) we have to choose, out 
of 5 pairs, the one that best matches the type of relation implicit in such a 
"query." In this case, it is reasonable to say car : highway would be a better 
match than (the somewhat nonsensical) soda : ocean , since cars flow on a 
highway, and so does water in a river. Notice that if we were to measure the 
similarity between objects instead of relations, soda : ocean would be a much 
closer pair, since soda is similar to water, and ocean is similar to river. 

Nevertheless, it is legitimate to infer relational similarity from individual 
object features, as summarized by Gentner and Medina (1998) in their "kind 
world hypothesis." What is needed is a mechanism by which object features 
should be weighted in a particular relational similarity problem. We postu- 
late that, in analogical reasoning, similarity between features of objects is 
only meaningful to the extent by which such features are useful to predict 
the existence of the relationships. 

Our approach can be described as follows. Let A and B represent ob- 
ject spaces. To say that an interaction A : B is analogous to S = {A^ : , 
A^ : B^\. . . , AW : B^'} amounts to implicitly defining a measure of sim- 



FlG. 1. In order to score how well an arbitrary element X fits in with query set 
S = {X 1 , X 2 , . . . , X q }, the Bayesian sets methodology compares the marginal likelihood 
of the model in (a), P(X,S), against the model in (b), P(X)P(S). In (a), the random 
parameter vector Q is given a prior defined by the (fixed) hyperparameter a. The same (la- 
tent) parameter vector is shared by the query set and the new point. In (b), the parameter 
vector <d that generates X is different from the one that generates the query set. 



a 



a 




RANKING RELATIONS USING ANALOGIES 



7 



ilarity between the pair A: B and the set of pairs S, where each query item 
j\i k ) corresponds to some pair A % :BK However, this similarity is not 

directly derived from the similarity of the information contained in the dis- 
tribution of objects themselves, {A 1 } C A, {B 1 } C B. Rather, the similarity 
between A : B and the set S is defined in terms of the similarity of the func- 
tions mapping the pairs as being linked. Each possible function captures a 
different possible relationship between the objects in the pair. 

Bayesian analogical reasoning formulation. Consider a space 
of latent functions in A x B — > {0, 1}. Assume that A and B are two objects 
classified as linked by some unknown function f(A, B), that is, f(A, B) = 1. 
We want to quantify how similar the function f(A,B) is to the function 
<?(•,•)> which classifies all pairs {A l ,B ] ) G S as being linked, that is, where 
g{A 1 ,B 3 ) = 1. The similarity should depend on the observations {S,A,B} 
and our prior distribution over /(•,•) and <?(•,•)• 

Functions /(•) and g(-) are unobserved, hence the need for a prior that 
will be used to integrate over the function space. Our similarity metric will 
be defined using Bayes factors, as explained next. 

3.1. Analogy in function spaces via logistic regression. For simplicity, 
we will consider a family of latent functions that is parameterized by a 
finite-dimensional vector: the logistic regression function with multivariate 
Gaussian priors for its parameters. 

For a particular pair (A* G A, B 3 G B), let X** = [$i(A\ &) <$> 2 (A\Bi) 
■■■ <$> K {A\Bi)] J be a point on a feature space defined by the mapping 
$:ixB4!t K This feature space mapping computes a i-G-dimensional 
vector of attributes of the pair that may be potentially relevant to predicting 
the relation between the objects in the pair. Let U 3 G {0, 1} be an indicator 
of the existence of a link or relation between A 1 and B 3 in the database. Let 
= ,@k] T be the parameter vector for our logistic regression model 

such that 

(3.1) P(L ij = l\X lj ,0) = logistic(e T A^), 

where logistic(x) = (1 + e~ x )~ 1 . 

We now apply the same score function underlying the Bayesian sets 
methodology explained in Section 2. However, instead of comparing objects 
by marginalizing over the parameters of their feature distributions, we com- 
pare functions for link indicators by marginalizing over the parameters of 
the functions. 

Let L s be the vector of link indicators for S: in fact, each L G L s has the 
value L = 1, indicating that every pair of objects in S is linked. Consider 
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(a) (b) 

Fig. 2. The score of a new data point {A 1 , B 3 } is given by the Bayes factor that compares 
models (a) and (b). Node a represents the hyperparameters for 0. In (a), the generative 
model is the same for both the new point and the query set represented in the rectangle. 
Notice that our conditioning set S of pairs might contain repeated instances of a same point, 
that is, some A or B might appear multiple times in different relations, as illustrated by 
nodes with multiple outgoing edges. In (b), the new point and the query set do not share 
the same parameters. 



the following Bayes factor: 

P(L^ = 1,L S = 1|A^,S) 

1 ' p{itf = i|xy)P(L s = i|S)' 

This is an adaptation of equation (2.2) where relevance is defined now by 
whether and L s were generated by the same model, for fixed {X l i , S}. 
In one sense, this is a discriminative Bayesian sets model, where we pre- 
dict links instead of modeling joint object features. Since we are integrating 
out G, a prior for this parameter vector is needed. The graphical models 
corresponding to this Bayes factor are illustrated in Figure 2. 

Thus, each pair (A 1 , B^) is evaluated with respect to a query set S by the 
score function given in (3.2), rewritten after taking a logarithm and dropping 
constants as 

scored, B j ) = log P(L lj = 1|A^',S,L S = 1) 

(3-3) 

-logP(L lJ = l|A^). 

The exact details of our procedure are as follows. We are given a relational 
database {T>a-,T>b-,Cab)- Dataset T>a (T^b) is a sample of objects of type A 
(£?). Relationship table Cab is a binary matrix modeled as generated from 
a logistic regression model of link existence. A query proceeds according to 
the following steps: 

1. the user selects a set of pairs S that are linked in the database, where 
the pairs in S are assumed to have some relation of interest; 

2. the system performs Bayesian inference to obtain the corresponding pos- 
terior distribution for 0, P(0|S,L s ), given a Gaussian prior P(0); 
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Data T>~ 
(Unlinked pairs) 



Data V+ 
(Linked pairs) 




Ranked pairs 



Small query S " Predictive posteriors "P(T)W | S) 

(Linked pairs) 

Fig. 3. General framework of the procedure: first, a "prior" over parameters O for a 
link classifier is defined empirically using linked and unlinked pairs of points (the dashed 
edges indicate that creating a prior empirically is optional, but in practice we rely on this 
method). Given a query set S of linked pairs of interest, the system computes the predic- 
tive likelihood of each linked pair X>' 1 ' £ T> + and compares it to the conditional predictive 
likelihood, given the query. This defines a measure of similarity with respect to S by which 
all pairs in are sorted. 



3. the system iterates through all linked pairs, computing the following for 
each pair: 

P{L tj = 1|A^',S,L S = 1) = J P(L ij = l\X lj ,Q)P(Q\S,L s = l)dQ. 

P(L l 3 = is similarly computed by integrating over P(&)- All pairs 

are presented in decreasing order according to the score in equation (3.3). 

The integral presented above does not have a closed formula. Because 
computing the integrals by a Monte Carlo method for a large number of 
pairs would be unreasonable, we use a variational approximation [Jordan et 
al. (1999); Airoldi (2007)]. Figure 3 presents a summary of the approach. 

The suggested setup scales as 0(K 3 ) with the feature space dimension, 
due to the matrix inversions necessary for (variational) Bayesian logistic 
regression [Jaakkola and Jordan (2000)]. A less precise approximation to 
-P(0|S, L s ) can be imposed if the dimensionality of is too high. However, 
it is important to point out that once the initial integral P(©|S,L S ) is 
approximated, each score function can be computed at a cost of 0(K 2 ). 

Our analogical reasoning formulation is a relational model in that it mod- 
els the presence and absence of interactions between objects. By conditioning 
on the link indicators, the similarity score between A : B and C : D is always 
a function of pairs (A, B) and (C, D) that is not in general decomposable as 
similarities between A and C, and B and D. 



10 



SILVA, HELLER, GHAHRAMANI AND AIROLDI 



3.2. Comparison with Bayesian sets and stochastic block models. The 
model presented in Figure 2 is a conditional independence model for rela- 
tionship indicators, that is, given object features and parameters, the entries 
of Co are independent. However, the entries in Cd are in general marginally 
dependent. Since this is a model of relationships given object attributes, we 
call the model introduced here the relational Bayesian sets model. 

Our approach has some similarity to the so-called stochastic block models. 
These models were developed four decades ago in the network literature to 
quantify the notion of "structural equivalence" by means of blocks nodes 
that instantiate similar connectivity patterns [Lorrain and White (1971); 
Holland and Leinhardt (1975)]. Modern stochastic block model approaches, 
in statistics and machine learning, build on these seminal works by intro- 
ducing the discovery of the block structure as part of the model search strat- 
egy [Fienberg, Meyer and Wasserman (1985); Nowicki and Snijders (2001); 
Kemp et al. (2006); Xu et al. (2006); Airoldi et al. (2005, 2008); Hoff (2008)]. 
The observed features in our approach, X lJ , effectively play the same role 
as the latent indicators in stochastic block models. 3 Since X %3 is observed, 
there is no need to integrate over the feature space to obtain the posterior 
distribution of 0. This computational efficiency is particularly relevant in 
information retrieval and exploratory data analysis, where users expect a 
relatively short response time. 

As an alternative to our relational Bayesian sets approach, consider the 
following direct modification of the standard Bayesian sets formulation to 
this problem: merge the data sets T>a and T>b into a single data set, cre- 
ating for each pair (A l ,B 3 ) a row in the database with an extra binary 
indicator of relationship existence. Create a joint model for pairs by using 
the marginal models for A and B and treating different rows as being in- 
dependent. This ignores the fact that the resulting merged data points are 
not really i.i.d. under such a model, because the same object might appear 
in multiple relations [Dzeroski and Lavrac (2001)]. The model also fails to 
capture the dependency between A 1 and B 3 that arises from conditioning 
on U 3 , even if A 1 and B 3 are marginally independent. Nevertheless, heuris- 
tically this approach can sometimes produce good results, and for several 
types of probability families it is very computationally efficient. We evaluate 
it in Section 4. 

3.3. Choice of features and relational discrimination. Our setup assumes 
that the feature space $ provides a reasonable classifier to predict the ex- 
istence of links. Useful predictive features can also be generated automati- 
cally with a variety of algorithms [e.g., the "structural logistic regression" 



3 In a stochastic block model, typically each object has a single feature r\ indicating 
membership to some latent class. For a pair A l ,B J , the corresponding feature vector X 1,3 
would be (va,Vb). 
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of Popescul and Ungar (2003)]. See also Dzeroski and Lavrac (2001). Jensen 
and Neville (2002) discuss shortcomings of methods for automated feature 
selection in relational classification. 

We also assume feature spaces are the same for all possible combinations 
of objects. This allows for comparisons between, for example, cells from dif- 
ferent species, or web pages from different web domains, as long as features 
are generated by the same function $(•,•)• in general, we would like to relax 
this requirement, but for the problem to be well-defined, features from the 
different spaces must be related somehow. A hierarchical Bayesian formu- 
lation for linking different feature spaces is one possibility which might be 
treated in a future work. 

3.4. Priors. The choice of prior is based on the observed data, in a way 
that is equivalent to the choice of priors used in the original formulation of 
Bayesian sets [Ghahramani and Heller (2005)]. Let be the maximum likeli- 
hood estimator of using the relational database (T>a,T)b,Cab)- Since the 
number of possible pairs grows at a quadratic rate with the number of ob- 
jects, we do not use the whole database for maximum likelihood estimation. 
Instead, to get 0, we use all linked pairs as members of the "positive" class 
(L = 1), and subsample unlinked pairs as members of the "negative" class 
(L = 0). We subsample by sampling each object uniformly at random from 
the respective data sets T>a and T>b to get a new pair. Since link matrices 
Cab are usually very sparse, in practice, this will almost always provide an 
unlinked pair. Sections 4 and 5 provide more details. 

We use the prior P{&) =J\f(@, (cT) _1 ), where jV(m, V) is a normal of 
mean m and variance V. Matrix T is the empirical second moments matrix 
of the linked object features, although a different choice might be adequate 
for different applications. Constant c is a smoothing parameter set by the 
user. In all of our experiments we set c to be equal to the number of positive 
pairs. A good choice of c might be important to obtain maximum perfor- 
mance, but we leave this issue as future work. Wang et al. (2009) present 
some sensitivity analysis results for a particular application in text analysis. 

Empirical priors are a sensible choice, since this is a retrieval, not a predic- 
tive, task. Basically, the entire data set is the population, from which prior 
information is obtained on possible query sets. A data-dependent prior based 
on the population is important for an approach such as Bayesian sets, since 
deviances from the "average" behavior in the data are useful to discriminate 
between subpopulations. 

3.5. On continuous and multivariate relations. Although we focus on 
measuring similarity of qualitative relationships, the same idea could be ex- 
tended to continuous (or ordinal) measures of relationship, or relationships 
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where each L 13 is a vector. For instance, Turney and Littman (2005) mea- 
sure relations between words by their co-occurrences on the neighborhood 
of specific keywords, such as the frequency of two words being connected by 
a specific preposition in a large body of text documents. Several similarity 
metrics can be defined on this vector of continuous relationships. However, 
given data on word features, one can easily modify our approach by sub- 
stituting the logistic regression component with some multiple regression 
model. 

4. Ranking hyperlinks on the web. In the following application we con- 
sider a collection of web pages from several universities: the WebKB col- 
lection, where relations are given by hyperlinks [Craven et al. (1998)]. Web 
pages are classified as being of type course, department, faculty, project, 
staff, student or other. Documents come from four universities (Cornell, 
Texas, Washington and Wisconsin). We are interested in recovering pairs 
of web pages {^4, B} where web page A has a link to web page B. Notice 
that the relationship is asymmetric. Different types of web pages imply dif- 
ferent types of links. For instance, a faculty web page linking to a project 
web page constitutes a type of link. The analogical reasoning task here is 
simplified if we assume each web page object has a single role (i.e., exactly 
one out of the pre-defined types {course, department, faculty, project, staff, 
student, other}), and therefore a pair of web pages implies a unique type 
of relationship. The web page types are for evaluation purposes only, as we 
explain later: we will not provide this information to the model. 

Our main standard of comparison is a "flattened Bayesian sets" algo- 
rithm (which we will call "standard Bayesian sets," SBSets, in constrast to 
the relational model, RBSets). Using a multivariate independent Bernoulli 
model as in the original paper [Ghahramani and Heller (2005)], we merge 
linked web page pairs into single rows, and then apply the original algorithm 
directly to the merged data. It is clear that data points are not independent 
anymore, but the SBSets algorithm assumes this is the case. Evaluating 
this algorithm serves the purpose of both measuring the loss of not treating 
relational data as such, as well as the limitations of evaluating the similarity 
of pairs through models for the marginal probabilities of A and B instead 
of models for the predictive function P(L lJ \X lJ ). 

Binary data was extracted from this database using the same methodol- 
ogy as in Ghahramani and Heller (2005). A total of 19,450 binary variables 
per object are generated, where each variable indicates whether a word from 
a fixed dictionary appears in a given document more frequently than the av- 
erage. To avoid introducing extra approximations into RBSets, we reduced 
the dimensionality of the original representation using singular value decom- 
position, obtaining 25 measures per object. 
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In this experiment objects are of the same type, and therefore, dimen- 
sionality. The feature vector X tJ for each pair of objects {A l ,B 3 } consists 
of the V features for object A 1 , the V features of object B J , and mea- 
sures Z = {Z U ...,Z V }, where Z v = (A^ x J3g)/(|A*| x \\B 3 \\), ||^|| being 
the Euclidean norm of the ^-dimensional representation of A 1 . We also add 
a constant value (1) to the feature set as an intercept term for the logistic 
regression. Feature set Z is exactly the one used in the cosine distance mea- 
sure, 4 a common and practical measure widely used in information retrieval 
[Manning, Raghavan and Schiitze (2008)]. This feature space also has the 
important advantage of scaling well (linearly) with the number of variables 
in the database. Moreover, adopting such features will make our compar- 
isons fairer, since we evaluate how well cosine distance itself performs in 
our task. Notice that our choice of X tJ is suitable for asymmetric relation- 
ships, as naturally occurs in the domain of web page links. For symmetric 
relationships, features such as \A l v — B 3 V \ could be used instead. 

In order to set the empirical prior, we sample 10 "negative" pairs for 
each "positive" one, and weight them to reflect the proportion of linked to 
unlinked pairs in the database. That is, in the WebKB study we use 10 
negatives for each positive, and we count each negative case as being 350 
cases replicated. We perform subsampling and reweighting in order to be 
able to fit the database in the memory of a desktop computer. 

Evaluation of the significance of retrieved items often relies on subjective 
assessments [Ghahramani and Heller (2005)]. To simplify our study, we will 
focus on particular setups where objective measures of success are defined. 

To evaluate the gain of our model over competitors, we will use the follow- 
ing setup. In the first query, we are given all pairs of web pages of the type 
student — > course from three of the labeled universities, and evaluate how 
relations are ranked in the fourth university. Because we know class labels 
for the web pages (while the algorithm does not), we can use the classes of 
the returned pairs to label a hit as being "relevant" or "irrelevant." We label 
a pair (A 1 ^ 3 ) as relevant if and only if A 1 is of type student and B 3 is of 
type course, and A 1 links to B 3 . 

This is a very stringent criterion, since other types of relations could also 
be valid (e.g., staff — > course appears to be a reasonable match). However, 
this facilitates objective comparisons of algorithms. Also, the other class 
contains many types of pages, which allows for possibilities such as a student 
— » "hobby" pair. Such pairs might be hard to evaluate (e.g., is that particular 
hobby demanding or challenging in a similar way to coursework?). As a 



4 The cosine similarity measure between two items corresponds to the sum of the fea- 
tures in Z. 
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compromise, we omit all pages from the category other in order to better 
clarify differences between algorithms. 5 

Precision/recall curves [Manning, Raghavan and Schiitze (2008)] for the 
student — > course queries are shown in Figure 4. There are four queries, 
each corresponding to a search over a specific university given all valid stu- 
dent — > course pairs from the other three. There are four algorithms on 
each evaluation: the standard Bayesian sets with the original 19,450 binary 
variables for each object, plus another 19,450 binary variables, each cor- 
responding to the product of the respective variables in the original pair 
of objects (SBSetsI); the standard Bayesian sets with the original binary 
variables only (SBSETS2); a standard cosine distance measure over the 25- 
dimensional representation (Cosine 1) for each page, with pairs being given 
by the combined vector of 50 features; a cosine distance measure using the 
raw 19,450-dimensional binary for each document (Cosine 2); our approach, 
RBSets. 



student -> course (cornel!) 



student -> course (texas) 




RBSets 

SBSetsI 

SBSets2 

Cosinel — 

Cosine2 | | | 

0.2 0.4 0.6 0.8 

Recall 

student -> course (Washington) 




,/ RBSets — 
JSBSetsI 

SBSets2 

Cosinel 

Cosine2 — 

0.2 



0.4 0.6 

Recall 



04 

RBSets — 
2 SBSetsI 
SBSets2 
Cosinel 
Cosine2 — 








0.2 0.4 0.6 0.8 

Recall 

student -> course (Wisconsin) 




RBSets 

SBSetsI 

SBSets2 

Cosinel 

Cosine2 

0.2 



0.4 0.6 0.8 

Recall 



Fig. 4. Results for student course relationships. 



5 As an extreme example, querying student — > course pairs from the Wisconsin univer- 
sity returned student —> other pairs at the top four. However, these other pages were for 
some reason course pages — such as http://www.cs.wisc.edu/~markhill/cs752.html. 
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In Figure 4 RBSets demonstrates consistently superior or equal precision- 
recall. Although SBSets performs well when asked to retrieve only student 
items or only course items, it falls short of detecting what features of stu- 
dent and course are relevant to predict a link. The discriminative model 
within RBSets conveys this information through the link parameters. 

We also did an experiment with a query of type faculty — > project, shown 
in Figure 5. This time results between algorithms were closer to each other. 
To make differences more evident, we adopt a slightly different measure of 
success: we count as a 1 hit if the pair retrieved is a faculty — > project pair, 
and count as a 0.5 hit for pairs of type student — > project and staff — > project. 
Notice this is a much harder query. For instance, the structure of the project 
web pages in the texas group was quite distinct from the other universities: 
they are mostly very short, basically containing links for members of the 
project and other project web pages. 

Although the precision/recall curves convey a global picture of the per- 
formance of each algorithm, they might not be a completely clear way of 
ranking approaches for cases where curves intersect at several points. In 
order to summarize algorithm performances with a single statistic, we com- 
puted the area under each precision/recall curve (with linear interpolation 
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... 
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Recall 
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Fig. 5. Results for faculty — > project relationships. 
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Table 1 

Area under the precision/recall curve for each algorithm and query 



Student — >■ course Faculty — >■ project 





Cl 


C2 


RB 


SBl 


SB2 


Cl 


C2 


RB 


SBl SB2 


Cornell 


0.87 


0.82 


0.87 


0.82 


0.80 


0.19 


0.18 


0.24 


0.18 0.18 


Texas 


0.62 


0.32 


0.77 


0.55 


0.54 


0.24 


0.21 


0.29 


0.12 0.12 


Washington 


0.69 


0.31 


0.76 


0.67 


0.64 


0.40 


0.42 


0.47 


0.40 0.40 


Wisconsin 


0.77 


0.72 


0.88 


0.75 


0.73 


0.28 


0.30 


0.26 


0.19 0.21 


between points). 


Results 


are g 


;iven in 


Table 1 


. Numbers in 


bold indicate the 



largest area under the curve. The dominance of RBSets should be clear. 

5. Ranking protein interactions. The budding yeast is a unicellular or- 
ganism that has become a de-facto model organism for the study of molec- 
ular and cellular biology [Botstein, Chervitz and Cherry (1997)]. There are 
about 6000 proteins in the budding yeast, which interact in a number of 
ways [Cherry et al. (1997)]. For instance, proteins bind together to form 
protein complexes, the physical units that carry out most functions in the 
cell [Krogan et al. (2006)]. In recent years, significant resources have been 
directed to collect experimental evidence of physical proteins binding, in an 
effort to infer and catalogue protein complexes and their multifaceted func- 
tional roles [e.g., Fields and Song (1989); ltd et al. (2000); Uetz et al. (2000); 
Gavin et al. (2002); Ho et al. (2002)]. Currently, there are four main sources 
of interactions between pairs of proteins that target proteins localized in 
different cellular compartments with variable degrees of success: (i) litera- 
ture curated interactions [Reguly et al. (2006)], (ii) yeast two-hybrid (Y2H) 
interaction assays [Yu et al. (2008)], (iii) protein fragment complementation 
(PCA) interaction assays [Tarassov et al. (2008)], and (iv) tandem affinity 
purification (TAP) interaction assays [Gavin et al. (2006); Krogan et al. 
(2006)]. These collections include a total of about 12,292 protein interac- 
tions [Jensen and Bork (2008)], although the number of such interactions is 
estimated to be between 18,000 [Yu et al. (2008)] and 30,000 [von Mering et 
al. (2002)]. 

Statistical methods have been developed for analyzing many aspects of 
this large protein interaction network, including de- noising [Bernard, Vaughn 
and Hartemink (2007); Airoldi et al. (2008)], function prediction [Nabieva 
et al. (2005)] and identification of binding motifs [Banks et al. (2008)]. 

5.1. Overview of the analysis. We consider multiple functional catego- 
rization systems for the proteins in budding yeast. For evaluation purposes, 
we use individual proteins' functional annotations curated by the Munich 
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Institute for Protein Sequencing [MIPS, Mewes et al. (2004)], those by the 
Kyoto Encyclopedia of Genes and Genomes [KEGG, Kanehisa and Goto 
(2000)] and those by the Gene Ontology consortium [GO, Ashburner et al. 
(2000)]. We consider multiple collections of physical protein interactions that 
encode alternative semantics. Physical protein-to-protein interactions in the 
MIPS curated collection measure physical binding events observed experi- 
mentally in Y2H and TAP experiments, whereas physical protein-to-protein 
interactions in the KEGG curated collection measure a number of different 
modes of interactions, including phosporelation, methylation and physical 
binding, all taking place in the context of a specific signaling pathway. So we 
have three possible functional annotation databases (MIPS, KEGG and GO) 
and two possible link matrices (MIPS and KEGG), which can be combined. 

Our experimental pipeline is as follows: (i) Pick a database of functional 
annotations, say, MIPS, and a collection of interactions, say, MIPS (again), 
(ii) Pick a pair of categories, M\ and M2. For instance, take M\ to be cy- 
toplasm (MIPS 40.03) and M2 to be cytoplasmic and nuclear degradation 
(MIPS 06.13.01). (iii) Sample, uniformly at random and without replace- 
ment, a set S of 15 interactions in the chosen collection, (iv) Rank other 
interacting pairs 6 according to the score in equation (3.3) and, for compari- 
son purposes, according to three other approaches to be described in Section 
5.1.4. (v) The process is repeated for a large number of pairs M\ x M2, and 
5 different query sets S are generated for each pair of categories, (vi) Cal- 
culate an evaluation metric for each query and each of the four scores, and 
report a comparative summary of the results. 

5.1.1. Protein- specific features. The protein-specific features were gen- 
erated using the data sets summarized in Table 2 and an additional data 
set [Qi, Bar- Joseph and Klein-Seetharaman (2006)]. Twenty gene expression 
attributes were obtained from the data set processed by Qi, Bar- Joseph and 
Klein-Seetharaman (2006). Each gene expression attribute for a protein pair 
Pi : Pj corresponds to the correlation coefficient between the expression lev- 
els of corresponding genes. The 20 different attributes are obtained from 20 
different experimental conditions as measured by microarrays. We did not 
use pairs of proteins from Qi et al. for which we did not have data in the 
data sets listed in Table 2. This resulted in approximately 6000 positively 
linked data points for the MIPS network and 39,000 for KEGG. 

We generated another 25 protein-protein gene expression features from 
the data in Table 2 using the same procedure based on correlation coeffi- 
cients. This gives a total of 45 attributes, corresponding to the main data 
set used in our relational Bayesian sets runs. 



6 The portion of ranked list that is relevant for evaluation purposes is limited to a subset 
of the protein-protein interactions. More details are given in Section 5.1.3. 
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Table 2 

Collection of data sets used to generate protein- specific features 



No. 


Measurements description 


Data sources 


1. 


Expression microarrays 


Gasch et al. (2000); Brem et al. (2005); 






Primig et al. (2000); Yvert et al. (2003) 


2. 


Synthetic genetic interactions 


Breitkreutz, Stark and Tyers (2003); SGD 


3. 


Cellular localization 


Huh et al. (2003) 


4. 


Transcription factor binding sites 


Harbison et al. (2004); TRANSFAC 


5. 


Sequence similarities 


Altschul et al. (1990); Zhu and Zhang (1999) 



Another data set was generated using the remaining (i.e., nonmicroarray) 
features of Table 2. Such features are binary and highly sparse, with most 
entries being for the majority of linked pairs. We removed attributes for 
which we had fewer than 20 linked pairs with positive values according to 
the MIPS network. The total number of extra binary attributes was 16. 

Several measurements were missing. We imputed missing values for each 
variable in a particular data point by using its empirical average among the 
observed values. 

Given the 45 or 61 attributes of a given pair {Pi, Pj}, we applied a 
nonlinear transformation where we normalize the vector by its Euclidean 
norm in order to obtain our feature table X. 

5.1.2. Calibrating the prior for 0. We initially fit a logistic regression 
classifier using a maximum likelihood estimation (MLE) and our data, ob- 
taining the estimate 0. Our choice of covariance matrix £ for Q is defined 
to be a rescaling of a squared norm of the data: 

(5.1) (S)" 1 =Xp OS Xpos, 

where Xpos is the matrix containing the protein-protein features only of 
the linked pairs used in the MLE computation. 

5.1.3. Evaluation metrics. As in the WebKB experiment, we propose an 
objective measure of evaluation that is used to compare different algorithms. 
Consider a query set S, and a ranked response list R = {R 1 , R 2 , R 3 , . . . , R N } 
of protein-protein pairs. Every element of S is a pair of proteins Pi : Pj 
such that Pi is of class Mi and Pj is of class Mj, where M, and Mj are 
classes from either MIPS, KEGG or Gene Ontology. In general, proteins 
belong to multiple classes. This is in contrast with the WebKB experiment, 
where, according to our web page categorization, there was only one possible 
type of relationship for each pair of web pages. The retrieval algorithm 
that generates R does not receive any information concerning the MIPS, 
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KEGG or GO taxonomy. R starts with the linked protein pair that is judged 
most similar to S, followed by the other protein pairs in the population, 
in decreasing order of similarity. Each algorithm has its own measure of 
similarity. 

The evaluation criterion for each algorithm is as follows: as before, we 
generate a precision-recall curve and calculate the area under the curve 
(AUC). We also calculate the proportion (TOP10), among the top 10 ele- 
ments in each ranking, of pairs that match the original {M\,M2~} selection 
(i.e., a "correct" Pi-Pj is one where Pj is of class Mi and Pj of class M2, 
or vice-versa. Notice that each protein belongs to multiple classes, so both 
conditions might be satisfied.) Since a researcher is only likely to look at the 
top ranked pairs, it makes sense to define a measure that uses only a subset 
of the ranking. AUC and TOP 10 are our two evaluation measures. 

The original classes {M±,M2} are known to the experimenter but not 
known to the algorithms. As in the WebKB experiment, our criterion is 
rather stringent, in the sense that it requires a perfect match of each R 1 
with the MIPS, KEGG or GO categorization. There are several ways by 
which a pair R might be analogous to the relation implicit in S, and they 
do not need to agree with MIPS, GO or KEGG. Still, if we are willing to 
believe that these standard categorization systems capture functional or- 
ganization of proteins at some level, this must lead to association between 
categories given to S and relevant subpopulations of protein-protein inter- 
actions similar to S. Therefore, the corresponding AUC and TOP 10 are 
useful tools for comparing different algorithms even if the actual measures 
are likely to be pessimistic for a fixed algorithm. 

5.1.4. Competing algorithms. We compare our method against a variant 
of it and two similarity metrics widely used for information retrieval: 

1. The cosine score [Manning, Raghavan and Schiitze (2008)], denoted by 
COS. 

2. The nearest neighbor score, denoted by nns. 

3. The relational maximum likelihood sets score, denoted by MLS. 

The nearest neighbor score measures the minimum Euclidean distance be- 
tween R 1 and any individual point in S, for a given query set S and a given 
candidate point R . The relational maximum likelihood sets is a variation 
of RBSets where we initially sample a subset of the unlinked pairs (10,000 
points in our setup) and, for each query S, we fit a logistic regression model 
to obtain the parameter estimate 0| ILE . We also use a logistic regression 
model fit to the whole data set (the same one used to generate the prior 
for RBSets), giving the estimate Q MLE . A new score, analogous to (3.3), 
is given by IogP(L« = 1\X^, <df LE ) - log P(i# = 1\X^ , MLE ), that is, we 
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do not integrate out the parameters or use a prior, but instead the models 
are fixed at their respective estimates. 

Neither COS or NNS can be interpreted as measures of analogical simi- 
larity, in the sense that they do not take into account how the protein pair 
features X contribute to their interaction. 7 It is true that a direct measure of 
analogical similarity is not theoretically required to perform well according 
to our (nonanalogical) evaluation metric. However, we will see that there 
are practical advantages in doing so. 

5.2. Results on the MIPS collection of physical interactions. For this 
batch of experiments, we use the MIPS network of protein-protein inter- 
actions to define the relationships. In the initial experiment, we selected 
queries from all combinations of MIPS classes for which there were at least 
50 linked pairs P, : Pj in the network that satisfied the choice of classes. Each 
query set contained 15 pairs. After removing the MlPS-categorized proteins 
for which we had no feature data, we ended up with a total of 6125 pro- 
teins and 7788 positive interactions. We set the prior for RBSets using a 
sample of 225,842 pairs labeled as having no interaction, as selected by Qi, 
Bar- Joseph and Klein-Seetharaman (2006). 

For each tentative query set S of categories {M±,M2}, we scored and 
ranked pairs P[ : Pj such that both P[ and Pj were connected to some pro- 
tein appearing in S by a path of no more than two steps, according to the 
MIPS network. The reasons for the filtering are two-fold: to increase the 
computational performance of the ranking since fewer pairs are scored; and 
to minimize the chance that undesirable pairs would appear in the top 10 
ranked pairs. Tentative queries would not be performed if after filtering we 
obtained fewer than 50 possible correct matches. Trivial queries, where filter- 
ing resulted only in pairs in the same class as the query, were also discarded. 
The resulting number of unique pairs of categories {Mi, M2} was 931 classes 
of interactions. For each pair of categories, we sampled our query set S 5 
times, generating a total of 4655 rankings per algorithm. 

We run two types of experiments. In one version, we give to RBSets the 
data containing only the 45 (continuous) microarray measurements. In the 
second variation, we provide to RBSets all 61 variables, including the 16 
sparse binary indicators. However, we noticed that the addition of the 16 
binary variables hurts RBSets considerably. We conjecture that one reason 
might be the degradation of the variational approximation. Including the 



7 As a consequence, none uses negative data. Another consequence is the necessity of 
modeling the input space that generates X, a difficult task given the dimensionality and 
the continuous nature of the features. 
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Table 3 

Number of times each method wins when querying pairs of MIPS classes using the MIPS 
protein-protein interaction network. The first two columns, #AUC and #TOP10, count 
the number of times the respective method obtains the best score according to the AUG 
and TOP 10 measures, respectively, among the 4 approaches. This is divided by the 
number of replications of each query type (5). The last two columns, #AUC.S and 
#TOP10.S, are "smoothed" versions of this statistic: a method is declared the winner of 
a round of 5 replications if it obtains the best score in at least 3 out of the 5 replications. 
The top table shows the results when only the continuous variables are used by RBSets, 
and in the bottom table when the discrete variables are also given to RBSets 



Method 


#AUC 


#TOP10 


#AUC.S 


#TOP10.S 






(a) 






COS 


240 


294 


219 


277 


NNS 


42 


122 


28 


75 


MLS 


105 


270 


52 


198 


RBSets 


542 


556 


578 


587 






(b) 






COS 


314 


356 


306 


340 


NNS 


75 


146 


62 


111 


MLS 


273 


329 


246 


272 


RBSets 


267 


402 


245 


387 



binary variables hardly changed the other three methods, so we choose to 
use the 61 variable data set for the other methods. 8 

Table 3 summarizes the results of this experiment. We show the number 
of times each method wins according to both the AUC and TOP 10 criteria. 
The number of wins is presented as divided by 5, the number of random 
sets generated for each query type {M\,M2} (notice these numbers do not 
need to add up to 931, since ties are possible). Moreover, we also presented 
"smoothed" versions of this statistic, where we count a method as the winner 
for any given {M±,M2} category if, for the group of 5 queries, the method 
obtains the best result in at least 3 of the sets. The motivation is to smooth 
out the extra variability added by the particular set of 15 protein pairs for a 
fixed {Mi,M2j-. The proposed relational Bayesian sets method is the clear 
winner according to all measures when we select only the continuous vari- 
ables. For this reason, for the rest of this section all analysis and experiments 
will consider only this case. 



8 We also performed an experiment (not included) where only the continuous attributes 
were used by the other methods. The advantage of RBSets still increased, slightly (by 
a 2% margin against the cosine distance method). For this reason, we analyze the most 
pessimistic case. 



22 



SILVA, HELLER, GHAHRAMANI AND AIROLDI 



Table 4 

Pairwise comparison of methods according to the AUC and TOP10 criterion. Each cell 
shows the proportion of the trials where the method in the respective row wins over the 
method in the column, according to both criteria. In each cell, the proportion is calculated 
with respect to the 4655 rankings where no tie happened 



AUC TOP10 





COS 


NNS 


MLS 


RBSets 


COS 


NNS 


MLS 


RBSets 


COS 




0.67 


0.43 


0.30 




0.70 


0.46 


0.30 


NNS 


0.32 




0.18 


0.06 


0.29 




0.25 


0.11 


MLS 


0.56 


0.81 




0.25 


0.53 


0.74 




0.28 


RBSets 


0.69 


0.93 


0.74 




0.69 


0.88 


0.71 





Table 4 displays a pairwise comparison of the methods. In this table we 
show how often the row method performs better than the column method, 
among those trials where there was no tie. Again, RBSets dominates. 

Another useful summary is the distribution of correct hits in the top 10 
ranked elements across queries. This provides a measure of the difficulty of 
the problem, besides the relative performance of each algorithm. In Table 5 
we show the proportion of correct hits among the top 10 for each algorithm 
for our queries using MIPS categorization and also GO categorization, as 
explained in the next section. About 14% of the time, all pairs in the top 10 
pairs ranked by RBSets were of the intended type, compared to 8% of the 
second best approach. 



Table 5 

Distribution across all queries of the number of hits in the top 10 pairs, as ranked by 
each algorithm. The more skewed to the right, the better. Notice that using GO categories 
doubles the number of zero hits for RBSets 








1 


2 


3 


4 


5 6 7 8 9 


10 


Proport 


ion of top hits using MIPS categories and links specified by the MIPS 


database 


COS 


0.12 


0.15 


0.12 


0.10 


0.08 


0.07 0.06 0.05 0.04 0.07 


0.08 


NNS 


0.29 


0.16 


0.14 


0.10 


0.06 


0.05 0.03 0.03 0.03 0.03 


0.02 


MLS 


0.12 


0.12 


0.12 


0.10 


0.09 


0.08 0.07 0.06 0.07 0.06 


0.07 


RBSets 


0.04 


0.08 


0.09 


0.09 


0.09 


0.08 0.09 0.07 0.09 0.08 


0.14 


Proportion of top hits 


using 


GO categories 


and links specified by the MIPS database 


COS 


0.12 


0.13 


0.11 


0.10 


0.11 


0.09 0.06 0.06 0.04 0.06 


0.06 


NNS 


0.53 


0.23 


0.07 


0.02 


0.02 


0.02 0.04 0.01 0.00 0.00 


0.01 


MLS 


0.16 


0.11 


0.12 


0.10 


0.08 


0.08 0.08 0.06 0.05 0.06 


0.05 


RBSets 


0.09 


0.09 


0.10 


0.10 


0.08 


0.08 0.06 0.08 0.08 0.07 


0.12 
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Table 6 

Number of times each method wins when querying pairs of GO classes using the MIPS 
protein-protein interaction network. Columns #AUC, #TOP10, #AUC.S and 
#TOP10.S are defined as in Table 3 



Method 


#AUC 


#TOP10 


#AUC.S 


#TOP10.S 


COS 


58 


73 


58 


72 


NNS 


1 


10 





4 


MLS 


26 


55 


13 


38 


RBSets 


93 


105 


101 


110 



5.2.1. Changing the categorization system. A variation of this experi- 
ment was performed where the protein categorizations do not come from the 
same family as the link network, that is, where we used the MIPS network 
but not the MIPS categorization. Instead we performed queries according 
to the Gene Ontology categories. Starting from 150 pre-selected GO cat- 
egories [Myers et al. (2006)], we once again generated unordered category 
pairs {Mi,M2}. A total of 179 queries, with 5 replications each (a total of 
895 rankings), were generated and the results summarized in Table 6. 

This is a more challenging scenario for our approach, which is optimized 
with respect to MIPS. Still, we are able to outperform other approaches. 
Differences are less dramatic, but consistent. In the pairwise comparison of 
RBSets against the second best method, COS, our method wins 62% of the 
time by the TOP10 criterion. 

5.2.2. The role of filtering. In both experiments with the MIPS network, 
we filtered candidates by examining only a subset of the proteins linked to 
the elements in the query set by a path of no more than two proteins. It is 
relevant to evaluate how much coverage of each category pair {M\,M2\ we 
obtain by this neighborhood selection. 

For each query S, we calculate the proportion of pairs Pj : Pj of the same 
categorization {M\,M2} such that both Pj and Pj are included in the neigh- 
borhood. Figure 6 shows the resulting distributions of such proportions 
(from to 100%): a histogram for the MIPS search and a histogram for 
the GO search. Despite the small neighborhood, coverage is large. For the 
MIPS categorization, 93% of the queries resulted in a coverage of at least 
75% (with 24% of the queries resulting in perfect coverage). Although fil- 
tering implies that some valid pairs will never be ranked, the gain obtained 
by reducing false positives in the top 10 ranked pairs is considerable (results 
not shown) across all methods, and the computational gain of reducing the 
search space is particularly relevant in exploratory data analysis. 
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5.3. Results on the KEGG collection of signaling pathways. We repeat 
the same experimental setup, now using the KEGG network to define the 
protein-protein interactions. We selected proteins from the KEGG catego- 
rization system for which we had data available. A total of 6125 proteins 
were selected. The KEGG network is much more dense than MIPS. A total 
of 38,961 positive pairs and 226,188 negative links were used to generate our 
empirical prior. 

However, since the KEGG network is much more dense than MIPS, we 
filtered our candidate pairs by allowing only proteins that are directly linked 
to the proteins in the query set S. Even under this restriction, we are able 
to obtain high coverage: the neighborhood of 90% of the queries included all 
valid pairs of the same category, and essentially all queries included at least 
75% of the pairs falling in the same category as the query set. A total of 
1523 possible pairs of categories (7615 queries, considering the 5 replications) 
were generated. 

Results are summarized in Table 7. Again, it is evident that RBSets 
dominates other methods. In the pairwise comparison against COS, RB- 
Sets wins 76% of the times according to the TOP10 criterion. However, the 
ranking problem in the KEGG network was much harder than in the MIPS 
network (according to our automated nonanalogical criterion). We believe 
that the reason is that, in KEGG, the simple filtering scheme has much less 
influence as reflected by the high coverage. The distribution of the number 
of hits in the top 10 ranked items is shown in Table 8. Despite the success 
of RBSets relative to the other algorithms, there is room for improvement. 



Distribution of Average Proportion of Reachable Links 
(Links Specified by the MIPS Database) 




[0,60] (60,70] (70,80] (80,90] (90,100] 
Recall 



Fig. 6. Distribution of the coverage of valid pairs in the MIPS network, according to our 
generated query sets. Results are broken into the two categorization systems (MIPS and 
GO) used in this experiment. 
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6. More related work. There is a large literature on analogical reasoning 
in artificial intelligence and psychology. We refer to French (2002) for a sur- 
vey, and to more recent papers on clustering [Marx et al. (2002)], prediction 
[Turney and Littman (2005); Turney (2008a)] and dimensionality reduction 
[Memisevic and Hinton (2005)] as examples of other applications. Classical 
approaches for planning have also exploited analogical similarities [Veloso 
and Carbonell (1993)]. 

Nonprobabilistic similarity functions between relational structures have 
also been developed for the purpose of deriving kernel matrices, such as 
those required by support vector machines. Borgwardt (2007) provides a 
comprehensive survey and state-of-the-art methods. It would be interesting 
to adapt such methods to problems of analogical reasoning. 

The graphical model formulation of Getoor et al. (2002) incorporates 
models of link existence in relational databases, an idea used explicitly in 
Section 3 as the first step of our problem formulation. In the clustering 
literature, the probabilistic approach of Kemp et al. (2006) is motivated by 
principles similar to those in our formulation: the idea is that there is an 
infinite mixture of subpopulations that generates the observed relations. Our 
problem, however, is to retrieve other elements of a subpopulation described 
by elements of a query set, a goal that is closer to the classical paradigm of 
analogical reasoning. 

Table 7 

Number of times each method wins when querying pairs of KEGG classes using the 
KEGG protein-protein interaction network. Columns #AUC, #TOP10, #AUC.S and 
#TOP10.S are defined as in Table 3 



Method 


#AUC 


#TOP10 


#AUC.S 


#TOP10.S 


COS 


159 


575 


134 


507 


NNS 


30 


305 


17 


227 


MLS 


290 


506 


199 


431 


RBSets 


1042 


1091 


1107 


1212 



Table 8 

Distribution across all queries of the number of hits in the top 10 pairs, as ranked by 
each algorithm. The more skewed to the right, the better 



0123456789 10 



Proportion of top hits using KEGG categories and links specified by the KEGG database 



COS 


0.56 


0.21 


0.08 


0.03 


0.02 


0.01 


0.01 


0.01 


0.01 


0.01 


0.01 


NNS 


0.89 


0.03 


0.04 


0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


MLS 


0.57 


0.21 


0.08 


0.04 


0.02 


0.01 


0.01 


0.00 


0.00 


0.00 


0.00 


RBSets 


0.29 


0.24 


0.16 


0.09 


0.06 


0.03 


0.02 


0.01 


0.03 


0.02 


0.01 
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As discussed in Section 3.2, our model can be interpreted as a type of 
block model [Kemp et al. (2006); Xu et al. (2006); Airoldi et al. (2008)] with 
observable features. Link indicators are independent given the object fea- 
tures, which might not actually be the case for particular choices of feature 
space. In theory, block models sidestep this issue by learning all the neces- 
sary latent features that account for link dependence. An important future 
extension of our work would consist of tractably modeling the residual link 
association that is not accounted for by our observed features. 

Discovering analogies is a specific task within the general problem of gen- 
erating latent relationships from relational data. Some of the first formal 
methods for discovering latent relationships from multiple data sets were in- 
troduced in the literature of inductive logic programming, such as the inverse 
resolution method [Muggleton (1981)]. A more recent probabilistic method 
is discussed by Kok and Domingos (2007). Dzeroski and Lavrac (2001) and 
Getoor and Taskar (2007) provide an overview of relational learning methods 
from a data mining and machine learning perspective. 

A particularly active subfield on latent relationship generation lies within 
text analysis research. For instance, Stephens et al. (2001) describe an ap- 
proach for discovering relations between genes given MEDLINE abstracts. 
In the context of information retrieval, Cafarella, Banko and Etzioni (2006) 
describe an application of recent unsupervised information extraction meth- 
ods: relations generated from unstructured text documents are used as a 
preprocessing step to build an index of web pages. In analogical reasoning 
applications, our method has been used by others for question- answering 
analysis [Wang et al. (2009)]. 

The idea of measuring the similarity of two data points based on a predic- 
tive function has appeared in the literature on matching for causal inference. 
Suppose we are given a model for predicting an outcome Y given a treatment 
Z and a set of potential confounders X. For simplicity, assume Z € {0,1}. 
The goal of matching is to find, for each data point (Xj,li,Zj), the closest 
match (X.j,Yj, Zj) according to the confounding variables X. In principle, 
any clustering criterion could be used in this task [Gelman and Hill (2007)] . 
The propensity score criterion [Rosenbaum (2002)] measures the similarity 
of two feature vectors Xj and Xj by comparing the predictions P(Z{ = ljXj) 
and P(Zj = l|Xj). If the conditional P(Z = 1|X) is given by a logistic re- 
gression model with parameter vector 0, Gelman and Hill (2007) suggest 
measuring the difference between X^0 and While this is not the same 

as comparing two predictive functions as in our framework, the core idea of 
using predictive functions to define similarity remains. 

A preliminary version of this paper appeared in the proceedings of the 
11th International Conference on Artificial Intelligence and Statistics [Silva, 
Heller and Ghahramani (2007)]. 
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7. Conclusion. We have presented a framework for performing analogi- 
cal reasoning within a Bayesian data analysis formulation. There is of course 
much more to analogical reasoning than calculating the similarity of related 
pairs. As future work, we will consider hierarchical models that could in 
principle compare relational structures (such as protein complexes) of dif- 
ferent sizes. In particular, the literature on graph kernels [Borgwardt (2007)] 
could provide insights on developing efficient similarity metrics within our 
probabilitistic framework. 

Also, we would like to combine the properties of the mixed-membership 
stochastic block model of Airoldi et al. (2008), where objects are clustered 
into multiple roles according to the relationship matrix Cab , with our frame- 
work where relationship indicators are conditionally independent given ob- 
served features. 

Finally, we would like to consider the case where multiple relationship 
matrices are available, allowing for the comparison of relational structures 
with multiple types of objects. 

Much remains to be done to create a complete analogical reasoning sys- 
tem, but the described approach has immediate applications to information 
retrieval and exploratory data analysis. 
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SUPPLEMENTARY MATERIAL 

Supplement: Java implementation of the Relational Bayesian Sets method 

(DOI: 10.1214/09-AOAS321SUPP; .zip). We provide complete source code 
for our method, and instructions on how to rebuild our experiments. With 
the code it is also possible to test variations of our queries, analyzing the 
sensitivity of the results to different query sizes and initialization of the 
variational optimizer. 
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